Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Hardcode Legacy behavior to True to resolve warning. #446

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

Luka-D
Copy link
Contributor

@Luka-D Luka-D commented Jan 22, 2025

Description of the change

Proposing the change to set Legacy=True in the AutoTokenizer. This will continue the same functionality of sft_trainer.py that we currently have, but it will remove this warning from appearing:

You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.

More discussion can be found on the nature of the legacy behavior here: huggingface/transformers#24565.

Even when Legacy is not explicitly set, it is by default set to True by the tokenizer. Thus, this change will not change the current functionality.

I have done some testing on the impact of Legacy behavior when tuning with sft_triner and have included my results below.

Related issue number

Resolving a warning raised in Issue #1205.

How to verify the PR

Run tuning locally and verify that the warning message from above does not appear anymore.

Was the PR tested

I created an image for Legacy = True and one for Legacy = False and tested it using the Travis CI flow. The changes were tested on llama3 and granite, using both LoRA tuning and Fine Tuning. I have graphed the results here:

image

The F1 micro score was identical for both models when using Fine Tuning. When using LoRA tuning, both models showed a small improvement in F1 micro score when Legacy was set to True. The difference is very small however and might subject to a margin of error while testing. We concluded that the results are pretty much the same regardless of what Legacy was set to.

We also wondered whether the setting would change the EOS and BOS tokens, so I ran tuning locally and compared the tokenized outputs. The outputs were the same for both settings, at 1 epoch and at 5 epochs. I have included the tokenized output files below for comparison.

Legacy True 1 Epoch.txt
Legacy False 1 Epoch.txt

In conclusion, we determined that the impact of the Legacy setting on the tokenizer was negligible. We decided to keep the functionality the same as it is, but to hardcode Legacy=True to avoid the warning appearing.

  • I have added >=1 unit test(s) for every new method I have added.
  • I have ensured all unit tests pass

Testing Legacy behavior set to True

Signed-off-by: Luka Dojcinovic <[email protected]>
Signed-off-by: Luka Dojcinovic <[email protected]>
We've decided to maintain the previous legacy behavior, coding it here to avoid that warning.

Signed-off-by: Luka Dojcinovic <[email protected]>
Copy link

Thanks for making a pull request! 😃
One of the maintainers will review and advise on the next steps.

@Luka-D Luka-D changed the title Fix: Hardcode Legacy behavior to True to resolve warning. fix: Hardcode Legacy behavior to True to resolve warning. Jan 22, 2025
@github-actions github-actions bot added the fix label Jan 22, 2025
Copy link
Collaborator

@aluu317 aluu317 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had a discussion regarding keeping the same legacy tokenizer. Explicitly setting the value resolves warning. Looks good to me!

Copy link
Collaborator

@willmj willmj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants